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ABSTRACT 

Current practice in language testing has not yet 
integrated classical test theory with assessment of language skills. 
In addition, language testing needs to part of theory development. 
Lack of sound testing procedures can lead to problems in research 
design and ultimately, inappropriate theory development. The debate 
over dimensionality of language and the testing of proficiency 
illustrates these difficulties. The iiicroduction of confirmatory 
analysis shouxd improve research on second language learning. In this 
paper the confirmatory use of a latent trait model (the "partial 
credit model") is demonstrated as a tool in the development and 
construct validation of an oral interview test. The model describes 
the relationship between an individual's proficiency and the 
difficulty of a language task, allowing for at least two categories 
of performance, in terms of the probability of a person providing a 
language sample adequate to earn a given score within a given limit. 
It was chosen because of its apparent consistency with observations 
over a wide range of classroom activities. Item analysis and 
roodel-to-data fit were conducted on a 29-'itea interview test given to 
270 sti dents. Use of the approach and model was found to he 
appropriate and valid. A 33-item bibliography is included. (MSE) 
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INTKODUCTIOK 



Approaches to the testing of a second language development have followed 
teaching methodologies and. In testing as fn teaching, there have been swift 
changes frow one methodology to another, with the proponents of each method 
denouncing the validity of all preceeding methods. 

There would appear to be at lest two aspects to many of the problems In 
language testing. The first Is the limitations of classical test theory for 
the construction and validation of spoken language tests. Current practices 
suggest that Integrated tests which attempt to assess language skills and 
classical test theory have been unable to provide a technology appropriate to 
this need. The most authentic and direct of the Integrative approaches have 
generally required an Interview format along with a method of judging samples 
of elicited language according to the degree of accuracy, authenticity and 
acceptability. In this type of test It Is more natural to grade a student's 
response In a number of categories according to Its degree of acceptability In 
a given situation. But while these approaches have been regarded as the most 
valid from both a theoretical and a practical perspective, classical test 
theory which was developed for use with dichotomously scored 
•correct 7 'Incorrect', discrete point test Items actalnlstered with paper and 
pencil, has limited appllcatons in Integrative, authentic language testing. 
For example, according to Harrison (1983), 

New developments in the theory of language testing since Lado have 
been slow because the production and statistical justification of 
multiple choice tests has made other more subjective assessments look 
weak by comparison. {Harrison, 1983: 84) 

A second aspect of the problems, possibly arising from the first, is that 
tests and other means of measuring second language proficiency are generally 
used to define variables used in empirical research studies. These studies 
have aimed at exploring and developing theories of language development, but 
have been hampered by potential weaknesses in the original measure, based in 
classical test theory. Stevenson (1985) points out that the difficulties in 
theory development may arise from a failure to recognise that theory 
development and testing must work together to further language research. 
Language testing needs to be part of theory development. 



As such, empirical studies depend on variables which have direction, scale and 
defintite metric properties. When correlational techniques are employed, as 
in factor and other analyses, the metric requirements are quite stringent. A 
lack of sound testing procedures can therefore lead to problems in research 
design and ultiroatly to inappropriate theory development. None is more common 
in Second Language Acquisition Research (SLA) than the debate over 
dimensionality of language. 

DIHEIiSIOMALrrY » EKploratory Sttidles 

Numerous definitions and discussions regarding the dimensionality of 
proficiency and communicative competence exist (e.g., Canale and Swain, 1980; 
Oiler, 1983; Hughes and Porter, 1983; Higgs, 1984; James, 1985; Rivera, 1985) 
and we do not wish to enter this debate. It remains as an issue that has been 
at the heart of a range of controversies in second language acquisition (SLA) 
research. Arguments in the language dimensionality debate range from a denial 
of any dimensions of proficiency (Pienemann and Johnston, 1985), the 
hypothesis of a unitary dimension (e.g. Oiler, 1983), to a divisible dimension 
or multi-dimensional models (e.g. Farhady, 1983). Although it has now been 
widely accepted that neither the unitary nor the divisible dimension 
hypotheses can be defended in their extreme forms, a comment on the research 
methodology employed in the debate is worthwhile since the various sides taken 
appear to be based on statistical and research design bases which might be 
questionable. 

Even the term Proficiency has been challenged in various literature (Johnson S 
Peineman, 1985), However it is generally accepted as a descriptive term and 
we have chosen to accept it and adopt it as a developmental description of an 
individuals relative status in terms of growth of language utility. As many 
researchers have accepted the use of the term, the debate has focussed on uhe 
nature of proficiency development, and whether there is even such a thing as 
dimensions. Davies (1981) has also made this point. 

Johnson and Pieneman's (1985) argument that there are no dimensions of 
proficiency ^s difficult to address, as even they report developmental and 
varitional dimensions. Their developmental dimension is used to illustrate 
their notion of an implicatlonal relationship. Under these circumstances it 
is difficult to rationalise zero dimensions in language development or 
proficiency. 
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In quantitative approaches an underlying mathematical moM of analysis Is 
selected. A collection of measures Is then used to demonstrate the validity 
of that matheinatlcal modelling of language development. In many cases the 
specif Iclatlon of the model Is given little or no attention and It is tested 
with data of unknown measurement properties. It Is probably safe to conclude 
that the underlying mathematical model In the statistical analysis Is rarely 
If ever, considered. If the im)del or equation underlying the factor analysis 
ANOVA, regression analysis, or other analyses, were written down and examined 
as a definition of the way language develops, there would be few who agreed 
with Its appropriateness. However, these analyses are very common. While 
there is a considerable amount of quality theory generation In the area of 
language acquisition, the methodology used to test these theories is not of 
equivalent quality* 

In factor analytic studies used to demonstrate dimensionality, it is conmon 
for an exploratory principal component analysis or principal factor analysis 
to be used, with a varimax rotation (e.g. Farhady, 1983). As this procedure 
Is specifically designed to identify multiple factors, which are Independent 
and maximally separated, the discovery of multiple dimensions Is not 
surprising. The Indeterminacy of factor analysis virtually assures the 
researcher that a factor solution will be found. The nature of the solution 
depends on the technique used to obtain the simple structure. 

When measures of a common type are used. It Is not surprising that a single 
dimension is identified. The possibility for this is clearly demonstrated in 
a multi-trait, multi-method study by Bachman and Palmer (1983). Furthermore, 
when small case studies Involving very few subjects are employed, it is not 
surprising that no dimensions can be identified. Since the number of cases 
may only just exceed the nmber of variables, and hence, small and unstable 
eigenvalues tend to indicate a lack of specific or general factors. The 
problem is a statistical one rather than a substantive one, and the nature of 
Identified dimensions in language proficiency development seems to be based on 
statistical reasoning, which by and large, predetermines the outcome in 
support of one or another type of theory (Vollmer and Sang, 1983). It is 
remarkable that the independence of factors, whether single or multiple, is 
Interpreted in dimensional terms. If multiple independent dimensions do 
exist, then it should be possible to develop teaching programmes around each 
factor completely isolated and unrelated to programmes for other factors or 
dimensions. There are not many practitioners who would accept this, but there 
are numerous research studies which conclude that the independence of factors 



Is strongly supported by the evidence obtained. The single or multiple 
factors appear to be manufactured by the analytical methods and the 
measureinents used (Carroll, 1983). 

Other non factor-based studies posit theories of language development and 
proficiency which are based on very small samples {e.g., Schumann, 1975; 
Krashen, 1977; Pienemann and Johnston, 1985). The argument that five, ten or 
even twenty cases producing thousands of utterances for analysis, constitute a 
large data base from which general i sable results are obtainable, is 
indefensible. There is no doubt that this kind of intensive casework is 
essential in theory development, but there is a need for more thorough theory 
testing before external validity can be claimed. Studies based on very small 
samples tend to yield broad generalisations beyond their external validity. 

Problems In research design and the Inappropriate use of statistical 
techniques are not restricted to the language dimensionality debate. For 
example, Willig (1985) in a review of research in bilingual education 
ccwsnented on the generally poor quality of research in the language area, and 
the problems that this provides for interpreting the results of many studies. 

The overwhelming message of these findings reflects on the quality of 
research and evaluation in bilingual education. The unacceptable 
quality of the major portion of this research is substantiated not 
only by the information contained In the studies, but also by that 
not contained in the studies ... Even the kinds of information basic 
for any reputable research report were frequently missing ... It is 
imperative that the quality of research and evaluation in bilingual 
education be upgraded. (Willig, 1985:311 ) 



COtyiRHATORY APPROACHES 

The introduction of confirmatory analysis methods into the language area (e.g. 
Bachman and Palmer, 1983) will play an important role in the improvement of 
research in the language area, and in particular, theory testing undertaken in 
SLA research (Stevenson, 1985). It is important however to recognise that the 
value of this type of analysis will depend on the quality of the underlying 
measure used in the analysis. It is at the level of constructing these 
measures that this study is primarily concerned. In this paper the 
confirmatory use of a latent trait model (the 'partial credit model') is 



demonstrated «s a tw)1 In the clevel0|Mi«nt and construct validation of an oral 
Interview test. It Is shown that, following the proposal of a dimension in 
language proficiency, the use of a dlawnslon-based measurement model such as 
the 'partial credit model* Is appropriate and Is capable of beginning the 
confirmatory. If more than one dimension is thought to exist, each may be 
defined, constructed and then tested using a dimensional model. These steps 
should be a priority to any further empirical studies. 



SELECTING A DIHEtglCWt 

One language develop»ent or acquisition model Is selected for this study 
because of Its apparent consistency with observations over a wide range of 
classroom activities, and its ubiquitous nature In literature dealing with 
language development. This by no means Implies that It was taken as a given. 
The structure of language indeed seen«d to be the basis of a division of 
theory and practice In second language acquisition. 

The differences observed In over 60 classroom lessons, the variety of courses, 
techniques and contexts meant that language acquisition or development models 
based in achievement of course-specific objectives would not be appropriate 
for large scale testing. Teacher Interviews, analysis of course records, 
reports and syllabi, coupled with the classroom observations and an extensive 
review of theoretical literature, suggested that at least one general 
development area was of general concern. This was the area of language 
structure or granroar. DeUils of the identification of this area are given in 
Griffin, Adams, Martin and Torollnson {1986). Thus, the study focussed on the 
development of grammatical skills within various contexts. The aim was then 
clarified to develop a test of spoken language focussing on the structural 
elements, while allowing contexts to vary. While many linguists and language 
instructors may judge this to be a controversial or even an incorrect 
decision, nevertheless, there are many instances in the literature which 
theorise that such a dimension exists. The study then sought to define the 
dimension as an example, without any claim to importance or to dominance among 
oth? possible dimensions. Once defined, we have attempted to confirm its 
existance and demonstrate how such a confirmatory analysis may be used. 



In this ftrst investigation we h«ve begun with a dimension that could loosely 
be termed 'grammatical coippetence'. This organisation began with a model 
proposed by Higgs and Clifford (1982). Essentially, the dimension we have 
attempted to define begins with Isolated elements of vocabulary; It then moves 
onto the use of some basic formulaic language and the basic structures, 
followed by the more difficult grammatical elements. A complete list of the 
test objectives and test Items can be found In Griffin, et. al. (1986). 



It appears that the classical true score and error model of measurement, along 
with the correlational techniques such as factor analysis, have been unale to 
deal with measurement problems In language testing and language research. 
Griffin (1985) proposed the possibility of using a latent trait model. In 
particular the rating scale model (Andrlch, 1978), for use with Interview data 
that are scored in a number of ordered response categories. In this paper we 
apply the partial credit model (Masters, 1982), a meirtjer of the same Rasch 
family of measurement models (Wright and Masters, 1985), to the design and 
calibration of oral interview test Items. Apart from allowing the analysis of 
the rating-scale-type data that result from an oral Interview, the model 
provides a framework from which to begin the study of dimensions In language 
proficiency. 

The partial credit model is an extension of the simple Rasch dichotomous model 
(Rasch, 1960, 1980) that allows for the scoring of items in any nuii^er of 
ordered categories. This is one of Its most obvious advantages over classical 
test theory that normally requires dichotomously scored test items, for 
example, in the simplest case, a student's response to an interview task or 
item may be rated 0, 1 , 2, according to Its degree of increasing acceptability 
and a ppror lateness. Any number of graded categories may be used but this 
scale was adopted for this study to illustrate the approach with the simplest 
multiple category use. Kultlple categories allows for varying degrees of 
correctness rather than the totally correct and incorrect classification 
allowable with the dichotomous model. 



The partial credit model describes the relationship between a person's 
proficiency (B^) and the difficulty of a language task (d^j) allowing for 
at least two categories of performance. This relationship is described in 
terms of the probability of a person providing a language sample of sufficient 
adequacy to be given a score (x^) out of a possible (ro) for a particular 
language task. The model is written as in the f omul a below. 



m| k 

k=0 j=0 



There are some restrictions placed on the equation to nake It fflatheinatkany 
correct, but, in general, the apparently complex equation Is inherently simple 
once the notation Is clarified. 

The symbol ^ represents the proficiency (or ability) of the person. The 
symbol represents the difficulty associated with scoring ii) (In our 
example j can equal 0, 1 or 2) on Item. The nature of the relationship 
between Item difficulty and person ability as specified by the model will 
become apparent when the item characterisitic curves for specific test items 
are examined below. 

In applying this model it Is hypothesised that language tasks based on 
grammatical structures can be ordered In difficulty from very easy to very 
difficult. Further, It is assumed that these tasks have their owu Inherent 
difficulty, regardless of the student group. 

If the student proficiency level is greater than the difficulty of the task 
then some degree of success (or acceptibility of response) can be expected. 
Less acceptable responses can be expected if the proficiency level is less 
than the difficulty of the task. We are hypothesising that a dimension 
exists, and that it Is possible to locate both person and items on that 
dimension. If we are unable to reject the hypotheses then a comparison of 
each person to each Item will give some confirmation for the dimension defined 
and yield powerful diagnostics about each student. 



By adopting the partial credit niodel defined in the roatheroatfcal equation, we 
are assuming that students* gramniatlcal coBipetence develops along the 
dimension, and that we can obtain fine Interpretations of responses based on 
the simple rating scale 0, 1 , or 2. 

To apply the model It Is necessary to define a dimension that Is to be 
measured by the test, construct Items to measure on that dimension and then 
validate the test with the model. Unlike exploratory factor analysts this 
should not be a 'fishing* exercise. The model allows the hypothesis of the 
exi stance of a specified dimension to be explicitly tested in terms of 
•goodness of fit*. (See Wright * Masters, 1982) It is not until measurable 
dimensions of this nature have been defined that it will be possible to 
examine the dimensionality of language proficiency, such that the debate about 
dimensionality can be considered as a non-issue until more confirmatory 
analyses have been completed. A more thorough discussion of these types of 
mathematical models and their potential for SLA research can be found in 
Griffin (ig85). 



TEST DEVElOPHBiT 

To obtain an adequate data base to undertake these analyses, an 
interview-based test was developed and administered to 270 students classified 
as having proficiency levels ranging from 0 to 1 4 on the Australian Second 
Language Proficiency Ratings (Ingram, 1984). A set of 29 items was developed 
and organised into four subsets of six to eight Items each. Each student was 
then administered one or two of the subsets. A description of the method of 
Item construction and the nature of the items can be found in Griffin, et al 
(1985, 1986). In brief, an analysis of course content and assessment methods 
was conducted, based on observations of 60 classroom lessons, together with 
meetings and teacher workshops at various adult migrant education centres. 
Course materials were also examined and a series of objectives developed for 
validation by teachers. The objectives were placed in sequence of 
instruction, and In estimated order of difficulty for students. Then using 
Mlllman's (1974) formulaic approach, a set of expanded objectives were 
developed and used to generate test Items. These were grouped into four 
subtests, administered to students under standardised Interview conditions and 
scored using the rating scale scoring procedure outlined above. 
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Results 



Each of the Item subsets were analyzed using the CREDIT computer porgram* 
(Masters, Wright and ludlow, 1980) and the results of these Item analyses are 
shown in Table 1 . Each Item in Table 1 is reported with two difficulty 
parameters (the columns labelled d(i,l) and d(1,2) respectively), their 
corresponding standard errors (the columns labelled se(1,l) and se(i,?) 
respectively) and an 'ndex of item fit to the partial credit model. Two 
difficulties are reported for each item because each item is worth two score 
points. 



TABLE 1 

Item Difficulties, Standard Errors and Fit to the Partial Credit Model 



"^^^^^ ^ Item pif f IcuUles. Standard Errprs and Fit to the 



Item 


d(i,0 


d(n2) 


se(i,l) 


se(n2) 


fit 


!.1 


-2.1 


-1.25 


.47 


.26 


0.41 


1.2 


-1.06 


.41 


.30 


.26 


2.27 


1.3 


-1.51 


-1.63 


.43 


.30 


-0.14 


1.4 


-0.12 


.14 


.27 


.27 


-0.39 


1.5 


1.09 


2.96 


.26 


.52 


0.42 


1.6 


•0.29 


1.41 


.26 


.30 


0.66 


1.7 


-0.37 


0.46 


.2? 


.27 


-1.82 


1.8 


-0.75 


2.51 


.26 


.38 


-1.58 



N=94 



2.1 


-0.86 


0.90 


.22 


.26 


0.61 


2.2 


-1.58 


0.36 


.25 


.22 


0.30 


2.3 


-0.24 


0.43 


.22 


.25 


1.69 


2.4 


-0.93 


0.80 


.22 


.25 


-i.to 


zs 


-1.49 


2.15 


.23 


.36 


-0.87 




-1.02 


1.11 


.22 


.27 


-0.81 




-0.92 


1.28 


.22 


.28 


-0.25 



N=12B 



3.1 


0.64 


0.77 


.21 


.28 


1.67 


3.2 


-1.58 


1.96 


.22 


.30 


-1.28 


3.3 


-0.39 


1.24 


.20 


.27 


-0.44 


3.4 


-1.48 


0.17 


,23 


.21 


1.07 


3.5 


-2.12 


1.47 


.25 


.26 


-0.83 


3.6 


-0.24 


1.20 


.20 


.27 


-1.25 


3.7 


-2.31 


0.03 


.28 


.20 


-0.10 


3.8 


-1.29 


1.93 


.21 


.30 


-0.44 



N=H7 



4.1 


-2.31 


0.40 


.39 


.31 


0.95 


4.2 


-1.58 


0.48 


.33 


.32 


1.39 


4.4 


-1.90 


2.12 


.33 


.46 


-2.01 


4.5 


0.01 


0.56 


.31 


.36 


0.43 


4.6 


1.51 


1.66 


.37 


.60 


0.66 


4.7 


-1.40 


0.45 


.32 


.32 


-1.29 



ErJc Ns70 



The «liffictiUy figures range from approxiroately -3,0 to +3.0 for each 
subtest. The scale is a logit scale that Is, the logarithm of the odds of 
success at etch score level. The scale of measurement is interval in nature 
and can be transformed to another scale by a simple linear transformation 
{Wright and Stone, 1980), Host users of test scores prefer to remove negative 
scores, however we will retain the basic units in our discussion. 

The errors of measurement represent the accuracy each score given. Note first 
that the errors vary for each item. This is a characteristic of the Rasch 
model, in which traditional global measures of reliability are replaced by 
specific estimates of error at each point on the dimension. Note also that 
the values are small compared to the range of difficulty levels and that the 
errors are smallest in the mid range of the test. That is, each subset of 
items is most accurate about its midrange which indicates the most appropriate 
level of administration. In ASIPR terms these would correspond to 0+, 1-, 1 
and 1+. Hence the tests give maximum accuracy at levels for whey they were 
designed. Further when the overall person scores were correlated with ASLPR, 
a validity coefficient of 0.67 was (Stained. Hence the test n«asures a 
dimension strongly related to that assessed by the ASLPR but gives a marked 
increase in the accuracy of measurement, and move precise diagnostic records. 



imn CHARACTCTISTIC CURyES 

The item difficulties reported in Table 1 are best interpreted by reference to 
the item characteristic curves (ICC's). The ICC's show how the modelled 
probability of responding in a particular score category varies with ability. 
Figure 1 shows the item characteristic curve for item 4.4. In this plot there 
are three curves, one corresponding to each of the three possible scores on 
the item. The Pr(0) curve shows how the probability of scoring zero is almost 
one at low levels of ability and then decreases to zero as ability increases. 
The Pr{2) curve starts at zero for low abilities and then increases 
continuously and ability increases. The Pr(l) curve increases as ability 
increases to zero logits and then it decreases as the more able students are 
most likely to score two. The difficulties -1.90 and 2.12 for this item 
reported in Table 1 correspond to the intersection of successive probability 
curves. That is, Pr(0) and Pr{l ) intersect at -1 .90 and Pr(l ) and PrU) 
intersect at 2.12. For abilities less than -1.90 a students most likely -core 
is zero. For -1.90 to 2.12 the students most likely score is one and beyond 
2.12 it is two. 



Probob! lily 




Rbility 



Figure I item cherscte ristlc curve for ttpm 4.4 

Probobi lily 
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ERIC Figure 2 Item ChereclerlsHc Curves for Item 13 



Figure 2 shows the characteristic curves for Item 1.3. For this Item the 
difficulties values -1.51 and -1.63 are In reverse order. This weans that a 
score of 1 .0 Is never the mst llkeV for any student. Students below -1 .57 
are inost likely to score rero and students above -1.5? are most likely to 
score two. 

Since the two difficulties for each item are not Independent, the item 
characteristic curves are essentia* for Interpreting the position of the item 
on the ability dimension. They are also useful for examining the behaviour of 
items. For example. Griffin et al (1985) used the item characteristics to 
examine the suitability of scoring criteria for items. If the item 
characteristic curves are examined in conjunction with scoring criteria it is 
possible to examine the way different skills are mastered. For eample, item 
4,4 has a wide region in which a score of one is most probable. Item 4.4 
tests the use of the future tense by using a picture showing a train about to 
fall from a collapsing bridge. The student is asked, 'What do you think will 
happen*? The scoring criteria for the item are: clear explanation and 
consistent use of tense - two points; message clear but no consistency In 
structure - one point; unintelligible or disjoint words - zero points. In 
this case the wide region for one score point Indicates that many students of 
varying abilities are able to explain what is happening, but only the best 
students can use the future tense consistently and appropriately. It would 
appear that students of relatively low ability can express a sense of futurity 
but the formal structure is not mastered consistently until fairly high 
ability levels. 

In contrast to item 4.4, item 1.3 (a test of verbs) has no region where one is 
the most probable response. In this item a student is shown a set of six 
pictures with people performing various actions. The student is given an 
appropriate description of the first two pictures and is asked to describe the 
remaininq four pictures. The scoring criteria are: two or more appropriate 
verbs - two points; one appropriate verb - one point; no appropriate or 
intelllQible verbs - zero points. In this item any form of the verb was 
considered acceptable. The item characteristic curves show that few students 
provided only one verb. It would appear that after beginning to understand 
the use of verbs, students developed a range of verb-based vocabulary almost 
Immediately. 
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MBEL TO DOTA FIT 



Since the ^odel Is .ppHed to test the hypothesis that . din^nsion of 
gr.™«t1c,l competence ,s defined by the objectives exists. ,n 'J 
the .odel to def fit Is cruci.l. To test the hypothesis " - te. d each 
person response pattern is examined to determine whether each fits the 
Zsicn ^ss Jd hy the ^de,. The extent of fit to f--e, is s»»ar 
by a -fit' statistic. This is discussed in full in Wright and Masters (1982). 

The fit statistic for persons and itens that confonn to the "o"^ each 
expected value of zero and a standard deviation of abou one. m fit o each 
of the ite^s is Shown in the last co1„™ of Table 1 . When the ^ ^^'J ^ 
exceeds Ixo. or is less than negative t>,o. there ^^^^^J'^^^^^'ll^^^^'' 
this lte<n works in the same way as the other ite,ns in the test Hence 
ZL that these ite»s do not measure proficiency on ^T^U^li..^. 
dimension. When a large nud>er of items are found to misfit t < ^ 
that the set of items is working together to define a measurable dimension. 
Z\T.Z^ reject the hypothesis of a dimension. Previous researc has 
!hln that positive fit (in excess of positive two) usually occurs when an 
t^s s orcrg ries do not discriminate between low and high perfo«ers as 
rl ly as other items in the test. Negative fit (less than negative Uo) 
nlZly occurs when an item discriminates more highly than other 1te.s In the 



test. 



in the analyses reported in Table 1 only items 1.2 and 4.4 were found to have 
t tat K ouUide the range -2.0 to *2.0. hence it appears that each 

up Of items that are working together. ------ -- 

we cannot reject the hypothesis that such ' 
ordered along a measureable dimension. To investigate possible causes of the 
sTt of ItL 1.2 and 4.4. the students- scores on these «e-^"«d 
aaalnst their proficiency, or ability, as measured by ^---f " ^ 
which contained the misfitting Item. The misfit of "^/^J" 
Fioure 3 Is probably due to the performance of the two students ' 
L Of these students hae scored unity on the item, while 'J 

♦K^ ittm< in the subset, they would be expectea xo 
nverall nerfonrance on an the items in swu*^*-. ^ ^ ^ 

o^d a two. on examination of the recorded interviews for these 
students it was found that student 1 was probably scored =t y^ 

item asked the student to describe two persons obviously feeling hot and 
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-col(l% in that order. This stude .t i*as quite confident and quickly responded 
•cold and hot' rather than 'hot and cold', then limnedlately self -corrected. 
The second student responded 'He Is grin' In response to a question were the 
„K>re appropriate response was 'He Is happy'. This vocabulary difficulty was 
rated down while according to other responses we would expect this student to 
be able to respond In a fully acceptable manner. 



Item 
Score 



Stu^nt 1 



<5 ( 



-4 -3 -2 -1 I 12 3 4 
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figure 3 f^?^ ? f ^» "g"^ ^ g P Q O^'^^^ 
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n M » K 



■4 -i -2 -I 0 ; i 3 4 

At>ility 
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Figure- 4 shows that Item 4.4 has good dtscrirol nation. The reason for the 
misfit of Item 4.4 Is n-ost likely due to the fact that It discriminates more 
highly than the other Items In the subset. In traditional test analysis 
procedures this Item would probably be regarded as the best but this analysis 
suggest that Its performance should be monitored more closely in future 
because It is not behaving In the same manner as other Items In the test. 

Normally. In the case of misfit. Items should be deleted from further 
analysis, however the misfit of these two Items Is not extreme and does not 
appear to be due to flaws in the items. They were retained In the subsets for 
that reason. 

In addition to examining the Item misfit it Is also Important to examine any 
student misfit. As with Item fit, the student fit has an expected value of 
zero and a standard deviation of about unity when the data conform to the 
model. In Figure 5 below, a bar chart shows the distribution of the student 
fit statistics. The distribution of fit a, shown would appear to support the 
previous evidence of a strong fit of the model to the data. In particular, 
note that of the 439 points plotted In Figure 5 only 30 {6.8 percent) He 
outside the range -2.0 to 2.0. The response patterns of the 13 students with 
fit greater than 2.0 do not conform to the model and Indicate that for these 
student the dimension may not be defined In the same way as for the other 
students ... An intensive follow-up of these students is likely to led to 
useful diagnostic Information about the particular strengths and weaknesses of 
these students. The fit of less than -2.0 for 17 of the students is due to 
their response patterns being too orderly and they, as with high negative Item 
fit, are rarely considered as a real problem, but nonetheless, should be 
monitored. 
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Both the student and Item fit statistics support the hypothesis of a single 
dimension In the data. From a statistical perspective the model has been 
unable to reject the hypothesis that each of the item subsets can be measured 
on a single dimension, as defined by the test objectives. 



SliWVRY m CONCLUSIOMS 



There are several conclusions which wight be made from this study. First It 
Is possible to generate tests of spoken language based on generic foranilae and 
which my be scored using a simple rating scale. Second, this rating scale 
can be defined according to criteria which are robust to variations In 
scores. This robustness then allows the data to be scaled using the Rasch 
Rating Scale model. The latent trait approach has been show, to be successful 
In defining a grammatical or structural dimension. Further work has been done 
In demonstrating that the four subjects all measure the same gramattlcal 
dimension and this has been reported elsewhere {Griffin, et al 1986). 
Detailed Instructions In administering and interpreting are given elsewhere as 
well (Griffin 1986). 

The third conclusion from the study Is that we are able to use the Rasch model 
as a confirmatory approach to dimensionality. It has not been possible to 
examine the possibility of constructing a test based on other proposed 
dimensions or developmental sequences, ttowever, the procedures employed above 
could and should be applied to the development and valldtlon of tests 
hypothesised to test other dln^nslons, and implicatlonal or developmental 
sequences. 

It Is important to note that the Identification and confirmation of this 
dimension is based on first attempting to construct It and the testing whether 
the constructed variable fits the properties that a measurable dimesnlon 
should have. It does not rely on assumed models and correlational techniques 
discussed in the Introduction of this paper. 

Given the success of fitting the model to the data It is possible to argue 
that a probabilistic approach to the iresurement of language development may be 
adopted for each defensible, definable and demonstrable variable or dimension 
associated with SLA. This would include Johnson and Pelneman's (1985) 

implicatlonal dimension but it Is likely that their Variational dimension Is 
really "lack of fit" (or residuals associated with lack of fit) to their 
overall Implicatlonal dimension. 



Given the probabilistic approach of the model it has also been deiwmstrated 
,that 1teB> Characteristic Curves {ICC*s)» Item dlsnlminatlon. Item fit, person 
fit and other tests of the data each contribute Information about language 
perfonuance. All of this adds to the Information about language development 
and assists In providing profile development and monitoring techniques. 

The end result Is a measurable dimension of language developrwnt or 
proficiency. Small char»;es In development may be defined with greater than 
previously available accuracy. Further, these fine changes in proficiency may 
be translated into specific skill gains and Into the next most likely skill 
gain. 

It is also Important to note that In this brief discussion it has not been 
possible to examine the uses of the test beyond validating the existence of 
dimensions. Uses of tests, developed with the model, to monitor individual 
student progress along with the use of fit statistics for individual 
diagnosis, are quite powerful, and add to the advantages of applying the model 
to data of this type. 

The application of a dimension-based model that can be used with data gathered 
from oral interviews, clearly has a lot to offer measurement in the language 
area. If measures are developed from substantive theoretical perspectives and 
validated via a dlmension^ased measurement model, they can then be used in 
confirmatory or exploratory analyses that are powerful tools in the 
examination of relationships between the dimensions that have been defined and 
measured. 



1. The CREDIT program uses the unconditional maximum likelihood procedure 
described In Wright and Masters (1982) to jointly estimate the student 
abilities and Item difficulties. 
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